Causal Inference I

Chelsea Parlett-Pelleriti

Correlation \(\neq\) Causation

Group Discussion: “Correlation is not Causation”

  • What do people mean when they say this?

  • What’s an example of correlation not being causation?

  • Can you think of any examples where causation isn’t correlation?

  • When can we be more sure that correlation = causation?

Prediction vs. Inference vs. Description

  • Prediction: goal is to get output as close to actual output as possible

  • Description: goal is to describe the data

  • Inference: goal is to learn about relationships between variables

DAGs

  • Directed: edges have a direction from one node to another node; causes

  • Acyclic: there are no cycles that allow something to cause itself/feedback loops

  • Graphs: variables are connected via edges according to their causality

DAGs

DAGs don’t specify parametric assumptions about the relationships such as:

  • how strong the relationships are

  • whether the relationships are linear/non-linear etc.

the DAG structure just shows that we think the relationship exists.

DAGs

from: National Geographic, originally from tylervigen.com

DAGs

DAGs

💡 Do you think this DAG is valid?

Core DAG Paths

Fork: z is a confounder

Chain: z is a mediator

Collider: z is a collider

Forks

  • x is umbrella usage, y is sprinkler behavior, z is rain

  • x is annual income, y is cognitive test score , z is years of education

Forks

set.seed(540)
n <- 10000
years_ed <- rnorm(n, 12,4) %>% round() 
cog_test <- -20 +  years_ed*2.4 + rnorm(n,0,25)
annual_income <- 60000 + years_ed*1500 + rnorm(n,0,15000)

df <- data.frame(years_ed, cog_test, annual_income)

# build reg
lin_reg <- lm(annual_income ~ cog_test,
    data = df) %>% summary()

lin_reg$coef %>% round(4)
              Estimate Std. Error  t value Pr(>|t|)
(Intercept) 77062.1561   167.8604 459.0849        0
cog_test       84.3694     6.0006  14.0601        0

❓ Can we interpret the coefficient of cog_test causally? If we increase your cog_test scores, would we expect you to make more money?

Forks

  • x is annual income, y is cognitive test score , z is years of education

Forks

# build reg
lin_reg <- lm(annual_income ~ cog_test + years_ed,
    data = df) %>% summary()

lin_reg$coef %>% round(4)
              Estimate Std. Error  t value Pr(>|t|)
(Intercept) 59590.8788   489.6401 121.7034   0.0000
cog_test       -0.6094     6.0516  -0.1007   0.9198
years_ed     1523.3561    40.4359  37.6733   0.0000



💡 “controlling for” years_ed gives us a better estimate of the effect of cog_test

Forks

For forks, we want to control for the effect of the confounder to get a causal estimate of y ~ x

Note: We’ll talk about more ways to do this next time.

Chains

  • x is hours studying, y is exam score, z is material comprehension

  • x is diet quality (nutrient density) , y is mood , z is gut microbiome

Chains

set.seed(540)
n <- 100

diet_quality <- rnorm(n)
microbiome <- diet_quality*2.5 + rnorm(n)
mood <- microbiome*2 + rnorm(n, sd = 2)

df <- data.frame(diet_quality, microbiome, mood)

# build reg
lin_reg <- lm(mood ~ diet_quality,
    data = df) %>% summary()

lin_reg$coef %>% round(4)
             Estimate Std. Error t value Pr(>|t|)
(Intercept)   -0.1634     0.2966 -0.5509    0.583
diet_quality   5.1662     0.3146 16.4226    0.000

We expect the total effect to be \(\beta_{dq \to m} * \beta_{m \to mood} = 2.5 * 2 = 5\)

❓ Can we interpret the coefficient of diet_quality causally? If we increase your diet_quality , would we expect you to improve mood?

❓ What if we control for microbiome?

Side Note: Controlling For

When we say we are “controlling for” a variable, we want the effect of x on y after accounting for z. Covariates are one way of controlling for.

In this example: what is the effect of diet_quality if we already account for microbiome.

Chains

# build reg
lin_reg <- lm(mood ~ diet_quality + microbiome,
    data = df) %>% summary()

lin_reg$coef %>% round(4)
             Estimate Std. Error t value Pr(>|t|)
(Intercept)   -0.1730     0.1718 -1.0070   0.3164
diet_quality  -0.7506     0.4609 -1.6286   0.1066
microbiome     2.3592     0.1688 13.9745   0.0000

Chains

For chains we do not want to control for mediators to get an indirect causal estimate of y ~ x

For chains we dowant to control for mediators to get a direct causal estimate of y ~ x

Chains

  • x: caffeine consumption, y: productivity, z sleep quality.

❓ What is the direct effect of caffeine on productivity? What is the indirect effect of caffeine on productivity?

Colliders

  • x is being funny, y is being hot, z is dating me

  • x is extracurriculars , y is good exam scores , z is getting into Chapman

Colliders

set.seed(12938)
n <- 1000
logit <- function(x){
  return(1/(1 + exp(-x)))
}

extra_curriculars <- rnorm(n)
exam_scores <- rnorm(n)
acceptance <- rbinom(n, size = 1, prob = logit(1.6*extra_curriculars + exam_scores*2))

df <- data.frame(extra_curriculars, exam_scores, acceptance)

# build reg
lin_reg <- lm(exam_scores ~ extra_curriculars,
    data = df) %>% summary()

lin_reg$coef %>% round(4)
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)        -0.0416     0.0319 -1.3028   0.1930
extra_curriculars   0.0076     0.0317  0.2395   0.8108

❓ Can we interpret the coefficient of extra_curriculars causally? If we increase your extra_curriculars , would we expect you to improve grades?

❓ What if we control for acceptance?

Colliders

# build reg
lin_reg <- lm(exam_scores ~ extra_curriculars + acceptance,
    data = df) %>% summary()

lin_reg$coef %>% round(4)
                  Estimate Std. Error  t value Pr(>|t|)
(Intercept)        -0.6404     0.0387 -16.5301        0
extra_curriculars  -0.2501     0.0290  -8.6205        0
acceptance          1.2399     0.0585  21.2086        0

Colliders

Colliders

For colliders we do not want to condition on the collider in order to get a causal estimate of y ~ x

Collider bias can be a form of selection bias.

Core DAG Paths

Fork: z is a confounder (condition)

Chain: z is a mediator (maybe condition)

Collider: z is a collider (don’t condition)

Open vs. Closed Paths

  • Open: paths that transmit association between x and y (forks, chains)

  • Closed: paths that do not transmit association between x and y (colliders)

Backdoor Paths

  • Open

  • Non-causal

Causation: An Intro

Experimental Design

🦟 Design an experiment to test whether bed nets reduce malaria infections

Experimental Design

🦟 What’s a potential issue with your experiment?

Experimental Design

from: https://www.r-causal.org/chapters/02-whole-game

Why would random assignment to mosquito netting help here?

Causes

You have a headache. Jennifer gives you what looks like a pill. An hour later your headache goes away. How do you reason about whether or not the “pill” causes your headache to go away?

Experiments

  • x: wealth, z : using paxlovid, y : covid severity

Experiments

Which of these paths are causal for our question? non-causal?

Experiments

Experiments

is this path the causal path we’re interested in?

Experiments

if we can get an unbiased causal estimate of y ~ z using either set, what’s an argument for adjusting for x?

Adjustment Set

❓ Are there any paths that connect podcast to exam? Look for forks, colliders, or chains.

Adjustment Set

Adjustment Set

dagitty and ggdag

Installing dagitty and ggdag

# install.packages("dagitty")
# install.packages("ggdag")
library(dagitty)
library(ggdag)

Creating Your First DAG

my_first_dag <- ggdag::dagify(
  x ~ y,
  y ~ z + w
)

ggdag(my_first_dag) + 
  theme_dag()

Creating Your First DAG

my_first_dag <- ggdag::dagify(
  x ~ y,
  y ~ z + w
)

coordinates(my_first_dag) <- list(
  x = c(x = 0.5,y = 0,z = -0.5,w = -0.5),
  y = c(x = 0,y = 0,z = -0.25,w = 0.25)
)
ggdag(my_first_dag) + 
  theme_dag()

Creating Your First DAG

my_first_dag <- ggdag::dagify(
  x ~ y,
  y ~ z + w
)

# ggdag(my_first_dag, layout = "time_ordered") + 
ggdag(my_first_dag, layout = "sugiyama") + 
  theme_dag()

Creating Your First DAG

my_first_dag <- ggdag::dagify(
  x ~ y,
  y ~ z + w,
  exposure = "x",
  outcome = "z"
)

ggdag(my_first_dag, layout = "time_ordered") +
  theme_dag()

Creating Your First DAG

library(ggplot2)
my_first_dag <- ggdag::dagify(
  x ~ y ,
  y ~ z + w,
  q ~ z + x,
  exposure = "x",
  outcome = "z"
)

ggdag(my_first_dag, layout = "time_ordered") +
  theme_dag() +
  theme(legend.position = "top")

# ggdag_paths(my_first_dag, layout = "time_ordered") +
#   theme_dag() + 
#   theme(legend.position = "top")
# ggdag_adjustment_set(my_first_dag) +
#   theme_dag() +
#   theme(legend.position = "top")

Creating Your First DAG

library(ggplot2)
my_first_dag <- ggdag::dagify(
  x ~ y,
  y ~ z + w,
  q ~ z + x,
  exposure = "x",
  outcome = "z"
)

# ggdag(my_first_dag, layout = "time_ordered") +
#   theme_dag() +
#   theme(legend.position = "top")
ggdag_paths(my_first_dag, layout = "time_ordered") +
  theme_dag() +
  theme(legend.position = "top")

# ggdag_adjustment_set(my_first_dag) +
#   theme_dag() +
#   theme(legend.position = "top")

Creating Your First DAG

library(ggplot2)
my_first_dag <- ggdag::dagify(
  x ~ y,
  y ~ z + w,
  q ~ z + x,
  exposure = "x",
  outcome = "z"
)

# ggdag(my_first_dag, layout = "time_ordered") +
#   theme_dag() +
#   theme(legend.position = "top")
# ggdag_paths(my_first_dag, layout = "time_ordered") +
#   theme_dag() + 
#   theme(legend.position = "top")
ggdag_adjustment_set(my_first_dag) +
  theme_dag() +
  theme(legend.position = "top")

Creating Your First DAG

library(ggplot2)
my_first_dag <- ggdag::dagify(
  z ~ x + y,
  y ~ x,
  exposure = "x",
  outcome = "z"
)

ggdag(my_first_dag) + theme_dag()
print(adjustmentSets(my_first_dag, effect = "direct"))
{ y }

Real DAGs

Smoking and Birth Weight

smoking <- dagify(
  weight ~ defect + smoking,
  mortality ~ defect + smoking + weight,
  outcome = "weight",
  exposure = "smoking"
)


ggdag(smoking, layout = "time_ordered",
      node_size = 20) + theme_dag()

Gender Pay Gap

gender <- dagify(
  occ ~ discrim + abil,
  income ~ discrim + abil + occ,
  discrim ~ gend,
  outcome = "discrim",
  exposure = "income"
)


ggdag(gender, layout = "time_ordered",
      node_size = 20) + theme_dag()

From: Causal Inference the Mixtape (Cunningham)

Gender Pay Gap

ggdag_paths(gender) +
  theme_dag()

Gender Pay Gap

# from: https://mixtape.scunning.com/03-directed_acyclical_graphs
library(tidyverse)

tb <- tibble(
  female = ifelse(runif(10000)>=0.5,1,0),
  ability = rnorm(10000),
  discrimination = female,
  occupation = 1 + 2*ability + 0*female - 2*discrimination + rnorm(10000),
  wage = 1 - 1*discrimination + 1*occupation + 2*ability + rnorm(10000) 
)

# open path d -> o -> i
lm_1 <- lm(wage ~ female, tb)
# open path d -> o <- a -> y
lm_2 <- lm(wage ~ female + occupation, tb)
# woo!
lm_3 <- lm(wage ~ female + occupation + ability, tb)

Gender Pay Gap

summary(lm_1)

Call:
lm(formula = wage ~ female, data = tb)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.726  -2.859  -0.055   2.916  14.153 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.95419    0.05926   32.98   <2e-16 ***
female      -3.02009    0.08459  -35.70   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.229 on 9998 degrees of freedom
Multiple R-squared:  0.1131,    Adjusted R-squared:  0.113 
F-statistic:  1275 on 1 and 9998 DF,  p-value: < 2.2e-16

Gender Pay Gap

summary(lm_2)

Call:
lm(formula = wage ~ female + occupation, data = tb)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8292 -0.9117 -0.0010  0.9218  5.1408 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.196475   0.019963   9.842   <2e-16 ***
female      0.628497   0.029882  21.033   <2e-16 ***
occupation  1.813026   0.006156 294.535   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.359 on 9997 degrees of freedom
Multiple R-squared:  0.9084,    Adjusted R-squared:  0.9083 
F-statistic: 4.954e+04 on 2 and 9997 DF,  p-value: < 2.2e-16

Gender Pay Gap

summary(lm_3)

Call:
lm(formula = wage ~ female + occupation + ability, data = tb)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5080 -0.6892 -0.0075  0.6808  3.8340 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.00354    0.01727   58.10   <2e-16 ***
female      -1.00509    0.02856  -35.19   <2e-16 ***
occupation   1.00273    0.01004   99.84   <2e-16 ***
ability      2.01209    0.02222   90.57   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.008 on 9996 degrees of freedom
Multiple R-squared:  0.9497,    Adjusted R-squared:  0.9496 
F-statistic: 6.286e+04 on 3 and 9996 DF,  p-value: < 2.2e-16

Parent’s + Children’s Grades

Draw a DAG that looks at the relationship between Parent’s grades in high school and their children’s grades in high school. Include as many factors as you can think of!

Build this DAG using ggdag or dagitty and look at the possible adjustment sets!